ββ Attaching core tidyverse packages ββββββββββββββββββββββββ tidyverse 2.0.0 ββ
β dplyr 1.1.2 β readr 2.1.4
β forcats 1.0.0 β stringr 1.5.0
β ggplot2 3.4.2 β tibble 3.2.1
β lubridate 1.9.2 β tidyr 1.3.0
β purrr 1.0.1
ββ Conflicts ββββββββββββββββββββββββββββββββββββββββββ tidyverse_conflicts() ββ
β dplyr::filter() masks stats::filter()
β dplyr::lag() masks stats::lag()
βΉ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate) # Deal with dateslibrary(mosaic)
Registered S3 method overwritten by 'mosaic':
method from
fortify.SpatialPolygonsDataFrame ggplot2
The 'mosaic' package masks several functions from core packages in order to add
additional features. The original behavior of these functions should not be affected by this.
Attaching package: 'mosaic'
The following object is masked from 'package:Matrix':
mean
The following objects are masked from 'package:dplyr':
count, do, tally
The following object is masked from 'package:purrr':
cross
The following object is masked from 'package:ggplot2':
stat
The following objects are masked from 'package:stats':
binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
quantile, sd, t.test, var
The following objects are masked from 'package:base':
max, mean, min, prod, range, sample, sum
# Loads all the core timeseries packages, see messages# devtools::install_github("FinYang/tsdl")library(tsdl) # Time Series Data Library from Rob Hyndmanlibrary(tsbox) # "new kid on the block"library(TSstudio) # Each Plots, Decompositions, and Modelling with Time Series
Attaching package: 'TSstudio'
The following object is masked from 'package:tsbox':
ts_plot
Introduction
Any metric that is measured over regular time intervals forms a time series. Analysis of Time Series is commercially important because of industrial need and relevance, especially with respect to Forecasting (Weather data, sports scores, population growth figures, stock prices, demand, sales, supplyβ¦). For example, in the graph shown below are the temperatures over time in two US cities:
What can we do with Time Series? A time series can be broken down to its components so as to systematically understand, analyze, model and forecast it. As with other datasets, we have to begin by answering fundamental questions, such as:
What are the types of time series?
How do we visualize time series?
How do we decompose the time series into level,trend, and seasonal components?
Hoe might we make a model of the underlying process that creates these time series?
How do we make useful forecasts with the data we have?
We will first look at the multiple data formats for time series in R. Alongside we will look at the R packages that work with these formats and create graphs and measures using those objects. We will then look at obtaining the components of the time series and try our hand at modelling and forecasting.
Time Series Data Formats
There are multiple formats for time series data. The ones that we are likely to encounter most are
The tibble format: the simplest and most familiar data format is of course the standard tibble/dataframe, with a time column/variable to indicate that the other variables vary with time. The standard tibble object is used by many packages, e.g. timetk & modeltime
The ts format: We may simply have a single series of measurements that are made over time, stored as a numerical vector. The stats::ts() function will convert a numeric vector into an R time series ts object, which is the most basic time series object in R. The base-R ts object is used by established packages forecast and is also supported by newer packages such as tsbox.
The modern tsibble format: this is a new modern format for time series analysis. The special tsibble object (βtime series tibbleβ) is used by fable, feasts and others from the tidyverts set of packages.
There are many other time-oriented data formats tooβ¦probably too many,
Standards
such a tibbletime and TimeSeries objects. For now the best way to deal with these, should you encounter them, is to convert them to a tibble or tsibble and work with these. (Using say tsbox)
Creating and Plotting Time Series
In this first example, we will use simple ts data first, and then do another with tibble format that we can plot as is. We will then do more after conversion to tsibble format, and then a third example with a ground-up tsibble dataset.
Base-R ts format data
There are a few datasets in base R that are in ts format already.
One can see that there is an upward trend and also seasonal variations that also increase over time.
Let us take data that is βtime orientedβ but not in ts format. We use the command ts to convert a numeric vector to ts format: the syntax of ts() is:
start : represents the first observation in time series
end : represents the last observation in time series
frequency : represents number of observations per unit time. For example 1=annual, 4=quarterly, 12=monthly, 7=weekly, etc.
We will pick simple numerical vector data ( i.e. not a time series ) ChickWeight:
ChickWeight %>%head()
# Filter for Chick #1 and for Diet #1ChickWeight_ts <- ChickWeight %>%filter(Chick ==1, Diet ==1) %>%select(weight, Time)ChickWeight_ts <- stats::ts(ChickWeight_ts$weight, frequency =2) str(ChickWeight_ts)
Time-Series [1:12] from 1 to 6.5: 42 51 59 64 76 93 106 125 149 171 ...
Now we can plot this in many ways:
plot(ChickWeight_ts) # Using base-R#ts_boxable(ChickWeight_ts)tsbox::ts_plot(ChickWeight_ts,ylab ="Weight of Chick #1") # Using tsboxTSstudio::ts_plot(ChickWeight_ts,Xtitle ="Time", Ytitle ="Weight of Chick #1") # Using TSstudio
tibble data
Using the familiar tibble structure opens up new possibilities. We can have multipletime series within a tibble (think GDP, Population, Imports, Exports for multiple countries as with the gapminder1data we saw earlier). It also allows for data processing with dplyr such as filtering and summarizing.
1 https://www.gapminder.org/data/
gapminder data
Let us read and inspect in the US births data from 2000 to 2014. Download this data by clicking on the icon below, and saving the downloaded file in a sub-folder called data inside your project.
Soβ¦average births per month were higher in 2005 to 2007 and have dropped since. We can do similar graphs using day_of_week as our basis for grouping, instead of month:
births_2000_2014 %>%mutate(# So that we can have discrete colours for each week day# Using base::factor()# Could use forcats::as_factor() alsoday_of_week = base::factor(day_of_week,levels =c(1,2,3,4,5,6,7), labels =c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))) %>%group_by(year, day_of_week) %>%summarise(mean_weekly_births =mean(births, na.rm =TRUE)) %>%gf_line(mean_weekly_births ~ year, group =~ day_of_week, colour =~ day_of_week, data = .) %>%gf_point() %>%# palette for 12 coloursgf_theme(scale_colour_brewer(palette ="Paired")) %>%gf_theme(theme_classic())
Looks like an interesting story hereβ¦there are significantly fewer births on average on Sat and Sun, over the years! Why? Should we watch Greyβs Anatomy ?
So far we are simply treating the year/month/day variables are simple numerical variables. We have not created an explicit time or date variable. Let us do that now:
So there are several numerical variables for year, month, and day_of_month, day_of_week, and of course the births on a daily basis. tsbox::ts_plot needs just the date and the births column to plot with and not be confused by the other numerical columns, so let us create a time column from these three, but retain them for now. We use the lubridate package from the tidyverse:
If we need setup average monthly and weekly births as before, we need to understand more of data processing with time series, similar to what dplyr does for tibbles. We will do this shortly, but using tsibble however.
tsibble data
Finally, we have tsibble (βtime series tibbleβ) format data, which contains three main components:
an index variable that defines time;
a set of key variables, usually categorical, that define sets of observations, over time. This allows for each combination of the categorical variables to define a separate time series.
a set of quantitative variables, that represent the quantities that vary over time (i.e index)
Here is Robert Hyndmanβs video introducing tsibbles:
The package tsibbledata contains several ready made tsibble format data. Let us try PBS, which is a dataset containing Monthly Medicare prescription data in Australia.
Run data(package = "tsibbledata") in your Console to find out about these.
data("PBS")PBS
This is a large-ish dataset:
67K observations
336 combinations of key variables (Concession, Type, ATC1, ATC2) which are categorical, as foreseen.
Data appears to be monthly, as indicated by the 1M.
the time index variable is called Month
Note that there are multiple Quantitative variables (Scripts,Cost), a feature which is not supported in the ts format, but is supported in a tsibble. The Qualitative Variables are described below.
Type help("PBS") in your Console.
The data is dis-aggregated/grouped using four keys:
Concession: Concessional scripts are given to pensioners, unemployed, dependents, and other card holders
Type: Co-payments are made until an individualβs script expenditure hits a threshold ($290.00 for concession, $1141.80 otherwise). Safety net subsidies are provided to individuals exceeding this amount.
ATC1: Anatomical Therapeutic Chemical index (level 1). 15 types
ATC2: Anatomical Therapeutic Chemical index (level 2). 84 types, nested inside ATC1
Let us simply plot Cost over time:
PBS %>%gf_point(Cost ~ Month, data = .) %>%gf_line() %>%gf_theme(theme_classic())
This basic plot is quite messy.
tsibble has dplyr-like functions
We can use dplyr functions such as mutate(), filter(), select() and summarise() to work with tsibble objects. tsibble does not allow filtering based on categorical variables, that needs to be done with dplyr.
However, tsibble has specialized functions to do with the index (i.e time) variable and the key variables, things similar to what dplyr does.
Let us first see how many observations there are for each combo of keys:
We have 336 combinations of Qualitative variables, each combo containing 204 observations (except some!): so let us filter for a few such combinations and plot:
PBS %>% tsibble::group_by_key(ATC1, ATC2, Concession, Type) %>%gf_line(Cost ~ Month, colour =~ Type, data = .) %>%gf_point() %>%gf_theme(theme_classic())# For a specific combo of Qual variables(keys)PBS %>% dplyr::filter(Concession =="General", ATC1 =="A", ATC2 =="A10") %>%gf_line(Cost ~ Month, colour =~ Type, data = .) %>%gf_point() %>%gf_theme(theme_classic())
As can be seen, very different time patterns based on the two Types of payment methods. Strongly seasonal for both, with seasonal variation increasing over the years, but there is an much stronger upward trend with the Co-payments method of payment.
We can use tsibbleβs dplyr-like commands to develop summaries by year, quarter, month(original data): Look carefully at the new time variable created each time:
# Cost Summary by YearPBS %>% tsibble::group_by_key(ATC1, ATC2, Concession, Type) %>%index_by(year(Month)) %>%summarise(mean =mean(Cost, na.rm =TRUE))# Cost Summary by QuarterPBS %>% tsibble::group_by_key(ATC1, ATC2, Concession, Type) %>% tsibble::index_by(yearquarter(Month)) %>% dplyr::summarise(mean =mean(Cost, na.rm =TRUE))# Cost Summary by Month, which is the original data# Only grouping happens herePBS %>% tsibble::group_by_key(ATC1, ATC2, Concession, Type) %>%index_by() %>%summarise(mean =mean(Cost, na.rm =TRUE))# Original DataPBS
Finally, it may be a good idea to convert some tibble into a tsibble to leverage some of functions that tsibble offers:
This is DAILY data of course. Let us say we want to group by month and plot mean monthly births as before, but now using tsibble and the index variable:
#|label: Why not use dplyr group_by for tsibbles?#| layout-ncol: 2births_tsibble %>% dplyr::group_by(year) %>%# This grouping does not give a proper result# The grouping by `index` is different# Annual Birth Average as beforesummarise(mean_births =mean(births, na.rm =TRUE))
# Should give 15 rows but does not!# The original dataset does, however.births_tsibble %>% tsibble::index_by(year) %>% dplyr::summarise(mean_births =mean(births, na.rm =TRUE))
# 15 rows, one for each year
Candle-Stick Plots
Hmmβ¦can we try to plot boxplots over time (Candle-Stick Plots)? Over month / quarter or year?
Monthly Box Plots
# Monthly box plotsbirths_tsibble %>%index_by(month_index =~yearmonth(.)) %>%# 15 years# No need to summarise, since we want boxplots per year / monthgf_boxplot(births ~ date, group =~ month_index, fill =~ month_index, data = .) %>%# plot the groups# 180 plots!!gf_theme(theme_minimal())
Quarterly boxplots
births_tsibble %>%index_by(qrtr_index =~yearquarter(.)) %>%# 60 quarters over 15 years# No need to summarise, since we want boxplots per year / monthgf_boxplot(births ~ date, group =~ qrtr_index,fill =~ qrtr_index,data = .) %>%# 60 plots!!gf_theme(theme_minimal())
Yearwise boxplots
births_tsibble %>%index_by(year_index =~ lubridate::year(.)) %>%# 15 years, 15 groups# No need to summarise, since we want boxplots per year / monthgf_boxplot(births ~ date, group =~ year_index, fill =~ year_index, data = .) %>%# plot the groups 15 plotsgf_theme(scale_fill_distiller(palette ="Spectral")) %>%gf_theme(theme_minimal())
Although the graphs are very busy, they do reveal seasonality trends at different periods.
Seasons, Trends, Cycles, and Random Changes
Here are how the different types of patterns in time series are as follows:
Trend: A trend exists when there is a long-term increase or decrease in the data. It does not have to be linear. Sometimes we will refer to a trend as βchanging directionβ, when it might go from an increasing trend to a decreasing trend.
Seasonal: A seasonal pattern occurs when a time series is affected by seasonal factors such as the time of the year or the day of the week. Seasonality is always of a fixed and known period. The monthly sales of drugs (with the PBS data) shows seasonality which is induced partly by the change in the cost of the drugs at the end of the calendar year.
Cyclic: A cycle occurs when the data exhibit rises and falls that are not of a fixed frequency. These fluctuations are usually due to economic conditions, and are often related to the βbusiness cycleβ. The duration of these fluctuations is usually at least 2 years.
The function feasts::STL allows us to create these decompositions.
Let us try to find and plot these patterns in Time Series.
We have seen a good few data formats for time series, and how to work with them and plot them. We have also seen how to decompose time series into periodic and aperiodic components, which can be used to make business decisions.
In the Tutorial @secβslides-and-tutorials, we will explore modelling and forecasting of timeseries.
Your Turn
Choose some of the datasets in the tsdl and in the tsibbledata packages. Plot basic, filtered and model-based graphs for these and interpret.
References
Robert Hyndman, Forecasting: Principles and Practice (Third Edition). available online